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ABSTRACT 

The  estimation  of  the  parameters  of  a  linear  statistical 
model  is  generally  accomplished  by  the  method  of  least 
squares.  However,  when  the  method  of  least  squares  is 
applied  to  nonorthogonal  problems  the  resulting  estimates 
may  be  significantly  different  from  the  true  parameters. 

The  method  of  ridge  regression  may  provide  better  estimates 
in  these  cases;  however,  a  probability  distribution  of  the 
ridge  estimator  is  presently  not  known.  The  form  of  such  a 
distribution  is  dependent  upon  how  the  ridge  parameter,  k, 
is  selected.  Two  possible  objective  methods  of  choosing  k 
are  examined  to  determine  if  either  one  leads  to  a  useful 
probability  distribution. 
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I .  BACKGROUND 


The  following  conventions  will  be  used  throughout. 

Unless  otherwise  noted,  capital  letters  and  Greek  letters 
will  refer  to  matrices  and  vectors  while  lower  case  letters 
will  refer  to  scalars. 

A.  INTRODUCTION 

The  use  of  linear  statistical  models  is  widespread  in 
scientific  fields  of  all  kinds.  Generally,  the  linear 
statistical  model  is  postulated  as 

Y  =  X6  +  e  (1) 

where  Y  is  an  n  x  1  vector  of  n  observed  values  of  a 
dependent  variable,  X  is  an  n  x  p  matrix  containing  n 
values  for  each  of  p  predictor  (independent)  variables, 

3  is  a  p  X  1  vector  of  p  unknown  parameters  (or  coefficients) 
to  be  estimated  from  data,  and  e  is  an  n  x  1  vector  repre¬ 
senting  experimental  errors.  Usually,  the  experimental 
error  is  assumed  to  have  a  multivariate  normal  distribution 

with  mean  equal  to  zero  and  variance  covariance  matrix 
2  2 

equal  to  a  I  where  a  is  the  scalar  value  of  the  common 
variance  of  the  experimental  errors.  This  assumption 
will  be  made  throughout  this  paper. 

In  practice,  the  modeling  problem  is  to  estimate  the 
parameters  3  from  data  Y  and  X.  The  most  common  method  of 
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doing  this  is  called  least  squares  estimation  or  some¬ 
times  ordinary  least  squares  (OLS) .  The  latter  designation 
will  be  used  in  this  paper. 

Under  certain  fairly  general  and  common  conditions 
OLS  is  an  adequate  method  of  estimating  3.  However,  when 
the  data  is  ’’ill-conditioned"  or  nonorthogonal  OLS  may 
yield  poor  estimates  of  the  true  parameters. 

Ridge  regression  (RR)  has  been  proposed  [Ref.  1]  as  an 
alternative  estimation  method  that  might  yield  better  esti¬ 
mates  under  conditions  where  OLS  does  poorly. 


B.  ORDINARY  LEAST  SQUARES 

For  convenience,  it  is  assumed  that  the  elements  of  X 
are  scaled  such  that  X’X  has  the  form  of  a  correlation 
matrix.  This  is  done  by  forming  from  each  element  a 
new  element  x’-.  such  that 


x’..  =  (x..  -  x.)/s 
13  ^13  3^ 


(2) 


-  +*1*1 
where  x^  is  the  mean  value  of  the  elements  of  the  3 — 

independent  variable  and  s  is  its  standard  deviation 

times  an  appropriate  constant  such  that  the  diagonal 

elements  of  X’X  are  equal  to  one.  The  OLS  estimator  of 

B  is  then 


3  =  (X’X)"^X’Y 


(3) 
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•  1  ^ 

so  long  as  (X'X)  exists.^  The  estimator  3  is  unique, 

unbiased  and  is  the  best  linear  unbiased  estimator  (BLUE) 
of  3  (it  has  the  minimum  variance  among  all  linear  un¬ 
biased  estimators  of  3)  so  long  as  E(Y)  =  X3  and 

2  2 

E(Y  -X3)(Y  -X3)*  =  a  I  where  o  is  a  scalar,  as  assumed 
previously. 

The  OLS  estimator  3  is  commonly  used  and  is  particularly 

useful  when  it  can  be  assumed  that  Y  is  a  multivariate 

normal  vector  with  mean  vector  X3  and  covariance  matrix 

a  I.  In  this  case,  it  can  be  shown^  that  the  maximum 

likelihood  estimator  of  3  is  the  same  as  the  OLS  estimator 

and  furthermore,  since  3  is  a  linear  function  of  the  elements 

of  Y,  3  has  a  multivariate  normal  distribution  with  mean 

2  - 1 

vector  equal  to  3  and  covariance  matrix  a  (X'X)  .  This 

/\ 

latter  characteristic  of  3  allows  the  use  of  hypothesis 
tests  and  the  computation  of  confidence  bounds. 

Unfortunately,  in  some  cases  X'X  is  "ill-conditioned" 
and  OLS  yields  poor  estimates.  This  typically  occurs  when 
an  experiment  is  poorly  designed  or  there  are  economic  or 
physical  restraints  causing  strong  correlations  among  the 
predictor  variables.  In  this  case  X'X,  in  its  correlation 
matrix  form,  will  not  be  orthogonal. 


^For  a  derivation  and  details  of  properties  of  the  OLS 
estimator,  see,  for  example.  Ref.  2. 

^For  example,  see  Ref.  2,  page  182. 
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Hoerl  and  Kennard  [Ref.  3]  address  the  eigenvalues  of 
X'X  (denoted  by  A ^ ,  j  =  1 ,  2 ,  .  .  . .  p)  and  point  out  that 
nonorthogonal  data  are  characterized  by  the  smallest  eigen¬ 
value  being  much  less  than  unity  and  that,  since 

a  /A  ■  is  a  lower  bound  for  the  mean  squared  distance 
'  min  ^ 

between  3  and  3,  then  for  X'X  nonorthogonal,  the  difference 
between  3  and  3  has  a  high  probability  of  being  large. 

When  X'X  is  nonorthogonal  3  is  characterized  by  one  or  more 
of  the  following  difficulties,  for  example: 

(1)  large  variance, 

(2)  large  magnitude  of  residual  errors, 

(3)  incorrect  signs  of  parameter 

estimates . 

C.  RIDGE  REGRESSION 

A.  E.  Hoerl  suggested  [Refs.  1  and  4]  that  the  large 

/s 

variance  of  3  for  nonorthogonal  data  could  be  reduced  by 
the  addition  of  a  constant  k  ^  0  to  the  diagonal  elements  of 
X'X,  thus  yielding 

3*  =  (X'X  +  kl)'^  X'Y  (4) 

as  as  estimator.  Equation  (4)  is  derived  in  Appendix  A. 

Note  that  for  k  equal  to  zero  the  estimator  3  is  equal 
to  the  OLS  estimator  3.  Therefore,  OLS  can  be  thought  of 
as  a  special  case  of  ridge  regression. ^  Hoerl  suggested 

^See  Appendix  B  for  a  discussion  of  an  even  more 
general  estimator. 
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the  name  "ridge  regression"  for  this  procedure  because  of 
its  mathematical  similarity  to  some  of  his  earlier  work 
[Ref.  5]  on  quadratic  response  functions.  Appendix  A 
contains  a  derivation  of  the  ridge  regression  estimator. 

1 .  Mean  Squared  Error 

The  rationale  behind  using  the  ridge  estimator  is 
to  minimize  the  mean  squared  error  (MSE)  associated  with 
the  estimate  instead  of  minimizing  the  sum  of  squares  of 
residuals  as  is  done  in  OLS.**  Hoerl  and  Kennard  show 
that  the  mean  squared  error  is  given  by 

MSE  =  Variance  +  (Bias)^  (5) 

Furthermore,  they  show  that  variance  is  a  monotonically 
decreasing  function  of  k,  that  the  squared  bias  is  a 
monotonically  increasing  function  of  k  and  that  the  rate 
of  change  of  variance,  for  nonorthogonal  data  and  small  k, 
is  considerably  larger  than  the  rate  of  change  of  the 
squared  bias.  Figure  1  is  a  graphical  illustration  of 
these  relationships.  Hoerl  and  Kennard  argue  that  it  is 
possible  to  find  some  k  ^  0  such  that  the  variance  is 
greatly  reduced  while  only  a  small  amount  of  bias  is  intro¬ 
duced,  thus  yielding  a  smaller  MSE  than  if  OLS  (k  =  0) 

'*In  the  case  of  unbiased  estimation,  which  OLS  is, 
these  are  equivalent  criteria. 
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were  used.  Indeed  they  show  that  if  3*8  is  bounded,  then 
such  a  k  always  exists.-  Thus,  proper  use  of  ridge  regres¬ 
sion  on  nonorthogonal  data  insures  a  reduced  MSB  of 
estimation. 

The  problem  remains  to  select  an  appropriate 

value  of  k.  Hoerl  and  Kennard  [Ref.  6]  suggest  the  use  of 

two  graphical  devices  as  aids  to  determining  an  appropriate 

value  of  k.  The  first  is  the  ridge  trace,  a  two-dimensional 

^  * 

plot  of  the  elements  of  3  as  functions  of  k  and  the  second 
is  an  estimate  of  the  squared  length  of  the  coefficient 

I  /s  :fe 

vector  3  3  .  The  ridge  trace  is  used  to  gain  an  under¬ 

standing  of  the  underlying  correlations  between  the  various 
predictor  variables  while  the  plot  of  3  3  is  used  to 

subjectively  determine  a  suitable  range  of  values  of  k. 

A  typical  ridge  trace  is  illustrated  in  Figure  2  and  a 
typical  plot  of  3  3  is  depicted  in  Figure  3.  Notice 

/\  I  /s 

that  3  3  ,  in  Figure  3,  decreases  steeply  for  small  k 

(k  <  0.2)  but  in  the  range  about  0.3  to  0.4  has  become 
much  less  sensitive  to  further  increases  in  k. 

2 .  Alternative  Methods  of  Choosing  k 

The  previously  described  method  of  subjectively 
choosing  a  suitable  value  of  k  is  the  current  method  in 
use  and  appears  to  be  useful.  A  major  problem  arises, 
however,  because  the  method  denies  to  the  analyst  know- 
ledge  of  the  probability  distribution  of  3  and,  therefore, 
any  probabilistic  inferences  concerning  the  resulting 
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FIGURE  3 


TYPICAL  PLOT  OF  THE 
SQUARED  LENGTH  OF  THE  RIDGE  ESTIMATOR 

(3  3  ) 
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estimator.  Hoerl  and  Kennard  have  suggested  a  general  form 
of  ridge  regression  [Ref.  3]  and  an  iterative  method  of 
determining  k.  In  addition,  Hemmerle  [Ref.  7]  has  derived 
a  closed  form  solution  based  on  this  method.  Another 
possibility  is  to  use  the  ridge  trace  or  the  plot  of 

t  A.:k 

3  3  quantitatively  to  calculate  a  point  value  for  k  in 

such  a  way  that  the  marginal  probability  distribution,  f.^*, 

3 

may  be  determined.  Two  such  methods  using  the  ridge  trace 
are  examined  in  the  next  section. 
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II.  PROPOSED  OBJECTIVE  RULES  FOR  CHOOSING  k 


The  slope  (rate  of  change)  of  the  ridge  trace  curves 
or  the  absolute  change  of  the  ridge  trace  curves  over  a 
specified  interval  may  be  used  to  determine  a  value  of  the 
ridge  parameter,  k,  objectively.  These  criteria  are 
discussed  here. 

Either  of  these  criteria  may  be  sensitive  to  the 
behavior  of  each  coefficient  3^.  In  general,  is  not 
monotonic  in  k,  although  they  all  approach  zero  as  k  is 
increased  without  bound.  It  has  been  noted  by  Marquardt 
and  Snee  [Ref.  8]  that  it  is  not  uncommon  for  one  or  more 
3^  to  increase  in  absolute  value  as  k  is  increased.  (See, 

/N  * 

for  example,  3^  in  Figure  2.)  Therefore,  the  ridge  trace 
should  be  examined  by  the  analyst  to  detect  any  behavior 

/\  it 

of  3^  that  might  adversely  affect  the  proper  selection  of  k 
even  though  the  ridge  trace  is  not  to  be  used  directly  to 
select  a  specific  value  of  k. 

It  is  clear  that  3  is  distributed  multivariate  normal 
if  Y  is  distributed  multivariate  normal  and  a  specific 
value  of  k  is  selected  a  priori.  However,  whenever  the 
value  of  k  is  dependent  on  a  data  sample  its  value  will 
not  generally  be  the  same  for  each  data  sample.  Therefore, 
k  is  a  random  variable.  Let  K  denote  this  (scalar)  random 
variable . 
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The  marginal  probability  distribution  of  g  may  be 
derived  from  the  joint  probability  distribution  of  K  and 
3  which  can  be  determined  by 

f-*  =  •  f^  (6) 

3  ,K  3  /K 

ys  i: 

if  the  conditional  distribution  of  3  given  K,  and 

the  marginal  distribution  of  K,  fj^,  are  known.  As  stated 
above,  when  K  is  given,  the  distribution  of  3  is  known. 

It  remains  to  determine  the  marginal  of  K,  fj^.  Clearly, 
this  distribution  depends  on  how  K  is  related  to  Y.  The 
procedure  will  be  to  find  a  mapping  from  the  range  of  Y 
into  the  range  of  K  which  gives  the  marginal  distribution  of 
K.  With  this  distribution  and  the  known  conditional  distri¬ 
bution  of  3  given  K,  the  joint  distribution  of  3  and  K  may 
be  determined.  It  is  convenient  to  consider  the  cumulative 
distribution  function,  Fj^(k)  ,  since,  if  the  functional 
relationship  of  K  to  Y,  K  =  h(Y),  is  known  then 

Fj,(k)  =  P[K  <  k]  =  P[h(Y)  <  k]  =  P[YeRj^]  (7) 

where  Rj,  is  a  region  in  the  space  of  Y  corresponding  to 
h(Y)  ^  k.  Thus  if  Rj^  can  be  determined  then,  since  the 
marginal  distribution  of  Y  is  known,  Fj^(k)  =  P[YcRj^]  can 
be  determined  and  fj^  may  be  determined  from  Fj^  by 
differentiation.  It  remains  to  determine  Rj^  corresponding 
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to  a  specified  region  in  the  space  of  K  and  an  objective 
rule  for  mapping  from  Y  to  K. 

A.  ABSOLUTE  VALUE  CRITERION 

The  practical  range  of  the  ridge  parameter  is  taken  to 
be  0  <  k  ^  1  in  the  literature.  It  seems  reasonable  then 
to  choose  the  smallest  value  of  k  such  that  all  3^(k)  are 
close  to  their  respective  values  at  k  =  1.  In  other  words, 

IB-Ck)  -  3*(1)|  <  6.;  i  =  1,  2,  .  .  p  (8) 

where  6^  is  a  constant  selected  by  the  analyst.  The  cri¬ 
terion  expressed  by  (8)  means  that  the  ridge  trace  curves, 

3^,  at  k  are  within  6.  of  their  value  at  k  =  1  beyond  which 

1“  Vi 

there  is  no  interest.  Here  6^  refers  to  the  i—  scalar 

component  of  a  p  x  1  vector,  6.  Suppose  that  at  some 
1“ 

k  =  kg  the  m—  component  of  the  left  hand  size  of  (1)  is 
the  one  whose  absolute  magnitude  is  largest.  Define  a 
p  X  1  vector  t  such  that  =  ±<5^^,  as  appropriate,  and  the 
other  components  of  t  are  equal  to  the  corresponding  values 
of  |3j^(kg)  -  3^(1)  I*  Then  equation  (8)  can  be  rewritten 
in  vector  form 


6  (kp)  -  6  (1)  =  I  (9) 
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B.  DERIVATIVE  CRITERION 

Another  potential  criterion  to  use  for  selecting  k  is 

ys  * 

to  require  that  the  slopes  of  all  3^  be  "flat  enough"  in 
the  sense  that 


* 

93. Ck) 

"“9Tc  -  1  =  1,  2,  .  .  .  ,  p  (10) 


where  6.  is  as  previously  defined.  Define  m  such  that  the 
•f*  Vi 

m—  component  of  the  left  hand  side  of  (10)  is  the  one 
whose  absolute  magnitude  is  largest  and  define  a  p  x  1 
vector  IT  such  that  iTj^^  =  ±<5,,^,  as  appropriate,  and  the  other 
components  of  tt  are  equal  to  the  corresponding  values  of 

✓V  * 

lii—l 

'3k  '• 

Then  equation  (1)  can  be  written,  in  vector  form 


33  (k)  _ 
Tk - 


TT 


(11) 
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III.  PROBLEM 


The  problem  is  to  determine  the  probability 
distribution  of  K  given  Y.  It  is  proposed  to  determine 
this  by  attempting  to  derive  and  examine  the  functional 
relationship  of  Y  and  K. 

A.  ABSOLUTE  VALUE  CRITERION 

The  criterion  expressed  by  equation  (9)  may  be  stated, 
by  substituting  from  equation  (4) 

(X'X  +  kI)'^X'Y  -  (X’X  +  I)'^X’Y  =  T  (12) 

and  by  factoring 

[(X’X  +  kl)"^  -  (X’X  +  I)'^]X’Y  =  T  (13) 

but,  as  shown  in  Appendix  C,  equation  (C-4)  ,  the  expression 
in  brackets  may  be  expanded  to 

(X’X  +  kI)"^[(X’X  +  I)  -  (X’X  +  kl)]  (X’X  +  I)"^  (14) 

Therefore,  by  canceling  terms  and  simplifying,  equation  (13) 
becomes 

(1  -  k)(X’X  +  kI)'^(X’X  +  I)“^X’Y  =  T  (15) 
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If  k  ^  1  and  if  (X'X  +  kl)'^  and  (X'X  +  exist,  then 


X'Y  =  (_i_^)(X'X  +  kI)(X’X  +  I)t 


(16) 


The  task  then  is  to  solve  the  linear  equations  in  (16) 
for  Y  in  order  to  determine  Rj^.  Unfortunately,  equation  (18) 
represents  p  linear  restraints  (hyperplanes)  on  n  unknown 
variables  where,  in  general,  n  >  p.  Furthermore,  t  is  a 
function  of  Y.  Thus,  Rj^  is  not  easily  determined  under 
this  criterion. 

B.  DERIVATIVE  CRITERION 

The  criterion  given  by  equation  (11)  may  be  stated  by 
substituting  from  equation  (4) 


|^[(X*X  +  kI)‘^X'Y]  =  TT 


(17) 


or  since  ^1)  ~  1 


-(X'X  +  kI)’^X'Y  = 


(18) 


Now,  if  (X'X  +  kl)  is  not  singular  then 


X'Y  =  (X'X  +  kl)^ 


(19) 
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where  the  negative  sign  has  been  dropped  since  the 
criterion  actually  specifies  the  absolute  value  of  the 
components  of  the  derivative  and  the  notation  of  ir  accounts 
for  proper  signs. 

Equation  (19)  is  similar  to  equation  (16),  as  it  should 
be  since  the  criteria  are  similar,  and  the  same  difficulties 
are  encountered  in  determining  Rj^  as  for  the  previous 
criterion.  In  addition,  the  derivative  of  ir  will  be 
difficult  to  determine.  Therefore,  the  derivative 
criterion  does  not  lead  to  a  useful  result  either. 
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IV.  NOTES  ON  THE  FULL  BAYESIAN  RIDGE  ESTIMATOR 


The  full  Bayesian  ridge  estimator  (FBRE)  is  suggested 
by  Eskew  [Ref.  9]  and  is  given  as 

I*  =  (X'X  +  kI)'^(X'Y  +  kBp)  (20) 

where  Bq  is  a  prior  estimate  of  B.  There  are  two  interesting 
properties  of  £  not  noted  by  Eskew. 

First  suppose  that  the  prior  Bq  is  chosen  to  be  the 
OLS  estimate  B.  Then 

B*  =  (X'X  +  kI)"^[X’Y  +  k(X’X)"^X'Y]  (21) 


and  hence 


B*  =  (X'X  +  kl)“^[l  +  k(X'X)'^]X'Y 


(22) 


But 


[I  +  k(X'X)'^]  =  (X'X  +  kI)(X'X)'^  (23) 

Substituting  (23)  into  (22) 

B*  =  (X'X)’^X'Y  =  B  (24) 
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Thus  if  the  OLS  estimator  is  used  as  a  prior  estimate 
for  the  FBRE ,  equation  (21),  then  the  resulting  estimate  is 
equal  to  the  OLS  estimate. 

Now,  suppose  that  any  prior  estimate  3q  is  used  in 
equation  (21)  but  the  resulting  estimate  is  then  used  as  a 
prior  in  (21)  to  compute  another  estimate.  If  this  pro¬ 
cedure  is  repeated  indefinitely,  in  the  limit  the  result 
will  again  be  the  OLS  estimator  regardless  of  what  prior, 
3q,  was  initially  used.  The  proof  of  this  is  shown  in 
Appendix  B. 
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V.  CONCLUSIONS  AND  RECOMMENDATIONS 


A.  CONCLUSIONS 

The  determination  o£  a  probability  distribution  of  the 
ridge  estimator,  g  ,  is  desirable  in  order  to  facilitate 
the  use  of  hypothesis  tests  and  the  computation  of  confi- 
dence  bounds  concerning  g  .  The  probability  distribution 
of  g  depends  on  the  objective  rule  used  to  select  the 
ridge  parameter,  k.  Neither  of  the  two  objective  rules 
examined  here  appears  to  lead  to  a  simply  determined 
probability  distribution. 

B.  RECOMMENDATIONS 

The  search  for  a  useful  probability  distribution  of 

k 

g  should  be  pursued  further.  In  particular,  the  closed 
form  solution  for  k  presented  by  Hemmerle  [Ref.  7]  may 
prove  fruitful.  Other  possibilities  include  investigating 
other  criteria  based  on  the  ridge  trace  such  as  minimizing 
the  sum  of  squares ,  over  all  i=l,  2,  .  .  .,p,  of  the 

ysk  ^k 

difference  between  g^(k)  and  g^Cl).  Also,  the  same 

criteria  applied  to  the  ridge  trace  could  be  considered 

^  * 

for  the  squared  length  of  g  . 
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APPENDIX  A 


DERIVATION  OF  THE  RIDGE  REGRESSION  ESTIMATOR 

The  residual  sum  of  squares  for  any  estimator  can  be 
written 


=  (Y  -  X3)'  (Y  -  X3)  =  e'e  (A-1) 

In  ridge  regression  it  is  desirable  to  minimize  the 
residual  sum  of  squares  subject  to  an  acceptable  length, 
c,  of  the  regression  vector  3  .  Expressed  as  a  Lagrangian 
restraint  problem  this  is 


min  $’C3*)  =  (Y  -  X3*) ’  (Y  -  X3*)  +  kC3*’3*  -  c)  CA-2) 


where  k  is  the  inverse  of  the  Lagrangian  multiplier. 

Taking  partial  derivatives  of  with  respect  to  3 
and  setting  them  equal  to  zero 


8<I>' 


0 


=  [y'Y  -  Y’X3  -  3  X'Y  +  3  X*X3  +  k3  3  ]  CA-3) 

83 


Hence 


0  =  -(Y*X)’  -  X'Y  +  2X'X3*  +  2k3* 


CA-4) 
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or 


2X'Y  =  2X'X6*  +  2kie*  (A-5) 

Therefore , 

X'Y  =  (X'X  +  kl)3*  (A-6) 

Now,  if  (X*X  +  kl)  is  non-singular  (which  k  is  selected 
to  ensure) ,  then 


=  (X'X  +  kI)"^X'Y 


(A- 7) 
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APPENDIX  B 


FULL  BAYESIAN  RIDGE  ESTIMATION 


A.  BACKGROUND 

Eskew  [Ref.  9]  points  out  that  ridge  estimation  is 
equivalent  to  minimizing  the  squared  differences  between 
the  regression  estimates  and  a  prior  estimate  of  zero 
subject  to  a  constraint  on  the  sum  of  squares  and  suggests 
that  a  non- zero  prior  might  be  more  reasonable.  Following 
this  line  of  reasoning  he  derives  the  full  Bayesian  ridge 
estimator  (FBRE) 


3  =  (X'X  +  kl)  (X'Y  +  kSg)  (B-1) 

where  3g  is  a  prior  estimate  of  the  true  parameters  3. 

Note  that  the  ridge  estimator  is  a  special  case  of  FBRE 
where  the  prior  is  taken  to  be  zero. 

Eskew  shows  that  the  variance  of  the  FBRE  is  the  same 
as  the  variance  of  the  ridge  regression  estimator  (RRE) 
while  the  squared  bias  of  the  FBRE  is  less  than  that  for 
the  RRE,  thereby  resulting  in  a  reduction  of  mean  squared 
error. 

B.  ITERATIVE  USE  OF  THE  FULL  BAYESIAN  RIDGE  ESTIMATOR 
Suppose  that  the  FBRE  is  calculated  using  any  prior, 

/N  ^ 

3q,  and  then  the  result,  is  used  as  a  prior  to 

calculate  another  FBRE,  If  this  procedure  is  repeated 
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in  times  the  result  may  be  written 


*  m 

^  =  (1/k)  (kA)^  X'Y  +  (kA)"'3o  (B-2) 

where  A  =  (X'X  +  kl)  It  is  interesting  to  determine  the 

form  of  ^  in  the  limit  as  m  approaches  infinity.  Since  A 
and  X'X  are  positive  definite  matrices  their  eigenvalues 
are  positive.  Let  >  0  be  an  eigenvalue  of  A  and  >  0 
be  an  eigenvalue  of  X'X.  Hoerl  and  Kennard  show  the  rela¬ 
tionship  between  and  to  be 

X.  =  l/(p.  +  k)  (B-3) 

Now  there  exists  an  orthogonal  p  x  p  matrix  P  with  P'P  =  I 
such  that 


P'AP  =  diag(Xj^,  X^,  .  .  .,  X^)  (B-4) 

or  since  the  eigenvalues  of  kA  are  kX^  and  the  eigenvalues 
of  A  are 


P'(kA)>  =  diag(k'“X",  k'"X", 


m,  m 


m,  m 


k^'x"^) 
P  P 


Now 


P'[  lim(kA)"^]P  =  lim  P' (kA)"^P 


CB-5) 


(B-6) 
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The  right  hand  side  o£  (B-6)  is  the  limit  o£  the  right  hand 
side  o£  (B-5).  By  substituting  £rom  equation  (B-3)  a 
typical  diagonal  element  is  0  <  [k/(p^  +  k)]"‘  <  1,  since 
>  0  £or  all  i  =  1,  2,  .  .  . ,  P.  There£ore,  each  o£  the 
elements  o£  the  right  hand  side  o£  (B-5)  approaches  zero 
as  m  approaches  in£inity.  Hence 

P’  limCkA)”*  P  =  0  (B-7) 

m->-oo 


This  can  only  occur  i£ 


limCkA)"^  =  0  (B-8) 

m-H» 

There£ore,  the  last  term  o£  equation  (B-2)  is  zero  in  the 
limit.  Now  de£ine  a  matrix  £unction  S  =  S(kA)  where 

oo 

s  =  (kA)^  CB-9) 

i=l 

DeRusso,  Roy,  and  Close  [Re£.  10]  show  that  S(kA)  converges 
i£  and  only  i£  S(kX^)  converges  £or  all  kX^,  the  eigenvalues 
o£  kA.  Clearly  this  will  occur  i£  and  only  i£ 


(B-10) 
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Substituting  equation  (B-3) 


(B-ll) 


or,  after  some  algebra 


Since  is  an  eigenvalue  of  a  positive  definite  matrix, 

X'X,  then  >  0  and  both  conditions  of  CB-13)  are  met. 

Therefore  S(kA)  does  converge.  To  see  v/hat  it  converges  to, 
define  S'  =  S  +  I  and  multiply  S'  on  the  left  by  (I  -  kA) 


(I  -  kA)S'  =  (I  -  kA)(I  +  kA  +  (kA)  ^  +  .  .  .) 


CB-13) 


and  multiplying  the  right  hand  side  out 


(I  -  kA)S'  =  [I  +  kA  +  (kA)^  +  .  .  .]  -  [kA  +  (kA) ^  +  .  .  .] 


I 


(B-14) 


Then 


S'  =  (I  -  kA) 


-1 


(B-15) 
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Then 


S  =  [I[(kA)’^  -  I]kA]‘^  -  I 
=  (l/k)A-^[(l/k)A'^  -  I]'^  -I 

Substituting  A  =  (X'X  +  kl)'^ 

S  =  [(l/k)X’X  +  I] [(l/k)X'X]"^  -  I 
=  k(X’X)'^ 

Substituting  S  into  equation  (B-2) 

lim  B*  =  (l/k)k(X’X)'^X’Y 

Therefore 

lim  i  *  =  (X'X)"^X’Y  =  B 
m-^00 

Thus  the  iterative  procedure,  starting  with  any  prior  Bq 
converges  to  the  OLS  estimator,  B. 


CB-16) 


(B-17) 


(B-18) 


(B-19) 
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APPENDIX  C 


MISCELLANEOUS  MATRIX  ALGEBRA  AND  CALCULUS 

Let  A,  B,  and  C  denote  m  x  n  matrices.  Denote 
inverses  by  A  B  and  C”^,  respectively. 

their 

A.  MATRIX  ALGEBRA 

First,  note  that 

C(A  +  B)"^  =  CAC'^  +  BC‘^)‘^ 

(C-l) 

s  ince 

C(A  +  B)"^  =  [(A  +  B)C"^]“^ 

(C-2) 

=  (AC‘^  +  BC"^)"^ 

(C-3) 

Also 

A"^  ±  B"^  =  A‘^(B  ±  A)B'^  =  B"^(B  ±  A)A"^ 

(C-4) 

since 


and 


B"^(B  ±  A)A‘^  =  (I  ±  B'^A3A 


(A‘^  ±  B‘^) 


CC-6) 


B.  MATRIX  CALCULUS 

Let  ACt) ,  B(t),  and  C(t)  denote  m  x  n  matrices  whose 
elements  may  be  functions  of  the  scalar  variable  t.  Let 

t  f 

A(t)  and  B(t)  denote  the  derivatives  of  A(t)  and  B(t) , 
respectively,  with  respect  to  t. 

The  following  are  shown  to  be  true  by  DeRusso,  Roy, 
and  Close  [Ref.  10]. 


^  A(t)B(t)  =  ACt)B(t)  +  A(t)B(t) 


(C-7) 


and 


(C-8) 
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